Evaluating AI Usage for Evaluation Purpose

Improving Report Summarization

“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”

Edward Osborne Wilson

Current Challenge

Considering the number of published evaluation reports across the UN system, information retrieval and evidence generalization challenges have arisen.

How to extract the most relevant findings and recommendations from within a specific context and to reuse and re-inject them in a different but appropriate context?

The fifth wave of the evidence revolution will be triggered by AI

Having human beings scan articles for relevant text for inclusion is likely a very inefficient way to produce reviews. Adopting these technologies will improve the speed and accuracy of evidence synthesis.

The four waves of the evidence revolution, published in Nature, Howard White, 2019

Results Cherry Picking: how to build effective “Evaluation Brief”?

Choosing what to include and what to exclude, especially in terms of highlighting critical aspects while deciding on what are the less relevant details to omit…

Relying on automated retrieval can help improving the objectivity and independence of the evaluation report summarization.

Cassandra, bearer of bad news

RAG at Rescue!

Retrieval Augmented Generation (RAG) combines the strengths of retrieval-based large language models & generative large language models.

Leaderboard for Large Language Models

Hugging Face Model Hub is the main platform that ranks and compares the performance of large language models (LLMs) on various benchmarks and tasks.

It includes Leaderboard for embedding and generation that:

  • Provides a clear and transparent comparison of different LLMs.
  • Helps identify the best models for specific tasks or domains.

Builing a RAG Pipeline requires:

  1. Data Collection: Select & Gather relevant reports.

  2. Model Testing: Test different generative and retrieval large language models.

  3. Integration: Combine models & funcions into a cohesive pipeline.

  4. Validation: Build a human-baseline to benchmark the performance of the integrated system.

  5. Evaluation: Assess accuracy, relevance, and efficiency using predefined metrics.

A RAG Evaluation Framework

Define and apply relevant metrics for both retrieval and generation to systematically & Continuosly assess the performance of the pipeline against existing models and baselines.

Applying a “Data Science” Approach!

Thorough Documentation

Keep detailed records of data sources, preprocessing steps, model configurations, and evaluation results
Clear guidelines on usage and troubleshooting.

Reproducible Workflows

Ensure that experiments can be replicated by others: - Code under Version control. - pipelines and scripts for data processing, model training, and evaluation automated. - Public repositories for collaborative work are shared.

Transparent Reporting

Clearly communicate methodologies and findings in reports and publications:

  • Type of Chunking
  • Name of Embedding
  • Retrieval Strategy
  • Name of Response LLM

Organising Validation with Human Feedback Loop

Incorporate ongoing feedback from users to continuously improve model performance.

  • Task-Specific Fine Tuning: Adjust models based on specific application requirements and domain knowledge.
  • Alignment Fine Tuning: Ensure that model outputs align with ethical guidelines and user expectations.

Experimentation Results

See full article here

  1. Report used: the 2019 Evaluation of UNHCR’s data use and information management approaches.

  2. Models Tested: Small large language model that can run out of a strong laptop: Command-r & Mixtral for the generation, bge-large-en-v1.5 for the embeddings

  3. Integration: Use of LangChain for the orchestration

  4. Validation: Human-baseline generated with labelStud.io.

  5. Evaluation: Assess accuracy, relevance, and efficiency using RAGAS (Retrieval Augmented Generation Assessment).

AI Deployment: Buy or Build?

Some Considerations

  • Total Cost of Ownership: Off-the-shelves “production-level” solutions do not exist. The real challenge is to correctly balance outsourcing/insourcing.
  • Modular Customization: The “orchestration” solutions should be flexible to adapt itself to incoming new development, without changing everything.
  • Agility - Iterate & Deliver: Adopt short development round to test with users.
  • Expertise & Training: Need to nurture in-house awareness and expertise to understand how RAG works, to test and then to help building validation dataset.